This is a quick walkthrough demonstrating how to generate SWNE plots alongside the Pagoda2 pipeline using a 3k PBMC dataset as an example.

To save time we will be using the pre-computed Pagoda2 object pbmc3k_pagoda2.Robj, which can be downloaded here.

First let’s load the required libraries

library(pagoda2)
library(swne)
library(Matrix)

Next let’s load the Pagoda2 object

p2 <- readRDS("/media/Home_Raid1/yan/swne/Data/pbmc3k_pagoda2.Robj")

Most scRNA-seq pipelines only use a subset of highly overdispersed genes for analysis. We’ll pull out those variable genes here, as well as the cluster labels.

## Pull out variable genes
n.od.genes <- 1.5e3
var.info <- p2$misc$varinfo; var.info <- var.info[order(var.info$lp, decreasing = F),];
od.genes <- rownames(var.info[1:n.od.genes,])
length(od.genes)
## [1] 1500
## Pull out clusters
clusters <- p2$clusters$PCA$multilevel
levels(clusters)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"

The easiest way to generate an SWNE embedding is to use the wrapper function RunSWNE

## Run SWNE
genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
                 "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
swne.embedding <- RunSWNE(p2, k = 14, var.genes = od.genes, genes.embed = genes.embed)
## calculating variance fit ... using gam [1] "1500 variable genes to use"
## running PCA using 1500 OD genes ..
## Loading required package: irlba
## .. done
## Plot SWNE
PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = clusters,
         do.label = T, label.size = 3.5, pt.size = 1.5, show.legend = F,
         seed = 42)

Now we’ll go through the SWNE embedding process step by step

First, let’s pull out the counts, and log-scale and adjust the variance of each gene

norm.counts <- ExtractNormCounts(p2, obj.type = "pagoda2", rescale = T, rescale.method = "log")
## calculating variance fit ... using gam
dim(norm.counts)
## [1] 8545 2696

We use the FindNumFactors function to identify the optimal number of factors to use. This function can be slow for large datasets, since it iterates over different values of k, so a simple “hack” is to just set k equal to the number of significant principal components.

k.range <- seq(2,16,2) ## Range of factors to iterate over
k.err <- FindNumFactors(norm.counts[od.genes,], k.range = k.range, n.cores = 8, do.plot = T)

We then run the NMF decomposition. We can initialize the NMF using either Independent Component Analysis (ica), Nonnegative SVD (nnsvd), or a completely random initialization. ICA is the best option for most datasets. The output of RunNMF is a list of the gene loadings (W) and NMF embedding (H).

k <- 14
nmf.res <- RunNMF(norm.counts[od.genes,], k = k, init = "ica", n.cores = 8)
nmf.scores <- nmf.res$H

Compute the SNN matrix from the PCA embedding

p2$calculatePcaReduction(nPcs = 20, odgenes = od.genes)
pc.scores <- t(p2$reductions$PCA)
snn <- CalcSNN(pc.scores, k = 20, prune.SNN = 1/20)

Runs the SWNE embedding. The three key parameters are alpha.exp, snn.exp, and n_pull, which control how the factors and neighboring cells affect the cell coordinates.

alpha.exp <- 1.25 # Increase this > 1.0 to move the cells closer to the factors. Values > 2 start to distort the data.
snn.exp <- 1.0 # Lower this < 1.0 to move similar cells closer to each other
n_pull <- 3 # The number of factors pulling on each cell. Must be at least 3.
swne.embedding <- EmbedSWNE(nmf.scores, snn, alpha.exp = alpha.exp, snn.exp = snn.exp,
                            n_pull = n_pull, proj.method = "umap", dist.use = "cosine")

For now, let’s hide the factors by setting their names to the empty string "". We’ll interpret them later

swne.embedding$H.coords$name <- ""

To help with interpreting these cell clusters, let’s pick some key PBMC genes to embed.

genes.embed <- c("MS4A1", "GNLY", "CD3E", "CD14",
                 "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")

Since we only ran NMF on the overdispersed genes, we need to project the rest of the genes onto the NMF projection to get gene loadings for all genes.

nmf.res$W <- ProjectFeatures(norm.counts, nmf.scores, n.cores = 8)

Now we can embed the key PBMC genes onto the visualization and remake the plot

swne.embedding <- EmbedFeatures(swne.embedding, nmf.res$W, genes.embed, n_pull = 3)

Let’s make the SWNE plot with the key genes embedded. The closer a cell or a cluster is to a gene, the higher the expression level. We set a seed for reproducible cluster colors, so that every plot will use the same colors to label the clusters.

PlotSWNE(swne.embedding, alpha.plot = 0.4, sample.groups = clusters, do.label = T,
         label.size = 3.5, pt.size = 1.5, show.legend = F, seed = 42)

We can validate this by overlaying the expression of one of these key genes onto the plot.

gene.use <- "CD8A"
gene.expr <- norm.counts[gene.use,]
FeaturePlotSWNE(swne.embedding, gene.expr, gene.use, alpha.plot = 0.4, label.size = 3.5, pt.size = 1.25)

We can also make a t-SNE plot for comparison.

tsne.scores <- p2$embeddings$PCA$tSNE
PlotDims(tsne.scores, sample.groups = clusters, pt.size = 0.75, label.size = 3.5, alpha = 0.3,
         show.legend = F, seed = 42, show.axes = F)

We can also interpret the factors by using the gene loadings matrix. Here, we extract the top 3 genes for each factor by gene loading. Since NMF tends to create a parts-based representation of the data, the factors often correspond to key biological processes or gene modules that explain the data.

gene.loadings <- nmf.res$W
top.factor.genes.df <- SummarizeAssocFeatures(gene.loadings, features.return = 3)
head(top.factor.genes.df)
##   assoc_score feature   factor
## 1   0.4878755    CCL5 factor_1
## 2   0.3388011     FTL factor_1
## 3   0.3272778    CD8A factor_1
## 4   0.6540625     LYZ factor_2
## 5   0.4186960    FTH1 factor_2
## 6   0.3507410     FTL factor_2

And finally, we can make a heatmap to visualize the top factors for each gene

gene.loadings.heat <- gene.loadings[unique(top.factor.genes.df$feature),]
ggHeat(gene.loadings.heat, clustering = "col")

Extract cluster colors for compatibility with other plotting methods (i.e. Monocle)

color.mapping <- ExtractSWNEColors(swne.embedding, sample.groups = clusters, seed = 42)
color.mapping
##         7         1         8         2         4         6         3 
## "#CD9600" "#FF61CC" "#F8766D" "#7CAE00" "#C77CFF" "#00A9FF" "#00BFC4" 
##         5 
## "#00BE67"